What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’?
# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(plotly)
library(shiny)
# Read in the gapminder_clean.csv data as a tibble using read_csv
data <- read_csv("gapminder_clean.csv")
# Remove rows with missing values in the two specific columns
data_Q1 <- data %>%
drop_na(`Energy use (kg of oil equivalent per capita)`, continent)
ANOVA (Analysis of Variance) is used here because the
continent variable is categorical, and
Energy use (kg of oil equivalent per capita) is
quantitative. ANOVA helps determine if there are statistically
significant differences in the mean energy use between the different
continents by comparing the variance within each continent to the
variance between continents.
# Perform ANOVA on the cleaned data
anova_result_Q1 <- aov(`Energy use (kg of oil equivalent per capita)` ~ continent, data = data_Q1)
# View the summary of the ANOVA result
summary(anova_result_Q1)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 7.715e+08 192870621 51.46 <2e-16 ***
## Residuals 843 3.160e+09 3748033
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot a boxplot showing the differences between the two variables and save it as a ggplot object
p1 <- ggplot(data_Q1) +
aes(
x = continent,
y = `Energy use (kg of oil equivalent per capita)`
) +
geom_boxplot(fill = "#4682B4") +
theme_minimal()
# Convert the ggplot object to an interactive Plotly plot
ggplotly(p1)
As the p-value of the ANOVA test is < 2e-16, this indicates that the null hypothesis can be rejected. The null hypothesis in this context states that there are no differences in the mean Energy use (kg of oil equivalent per capita) across different continent groups. Since the p-value is extremely small, we have strong evidence to conclude that there are statistically significant differences in average energy use between the continents.
In other words, the categorical variable continent does
have a significant effect on the quantitative variable
Energy use (kg of oil equivalent per capita). This means
that the average energy use differs across continents.
Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990?
In this analysis, ANOVA is used to compare the mean values of Imports
of goods and services (% of GDP) between two continents, Europe and
Asia. ANOVA is appropriate here because it allows us to test whether
there are significant differences in the mean of a quantitative
variable, Imports of goods and services (% of GDP) across
different levels of a categorical variable, continent.
# Subset the data to include only Europe and Asia, and filter for years after 1990
data_Q2 <- data %>%
filter(continent %in% c("Europe", "Asia")) %>%
filter(Year > 1990) %>% # Corrected to filter for all years after 1990
drop_na(`Imports of goods and services (% of GDP)`, continent) # Dropping rows with NA in relevant columns
# Perform ANOVA to analyze the effect of continent on Imports of goods and services (% of GDP)
correlations_Q2 <- aov(`Imports of goods and services (% of GDP)` ~ continent, data = data_Q2)
# View the summary of the ANOVA result
summary(correlations_Q2)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 1 1347 1347.2 2.012 0.158
## Residuals 210 140594 669.5
# Plot a boxplot showing the differences between the two variables and save it as a ggplot object
p2 <- ggplot(data_Q2) +
aes(
x = continent,
y = `Imports of goods and services (% of GDP)`
) +
geom_boxplot(fill = "#4682B4") +
theme_minimal()
# Convert the ggplot object to an interactive Plotly plot
ggplotly(p2)
The p-value of the ANOVA test is 0.158. This p-value is greater than
the typical significance level of 0.05. Therefore, we fail to reject the
null hypothesis, which states that there is no significant difference in
the Imports of goods and services (% of GDP) between Europe
and Asia in the years after 1990.
The boxplot shows that the means of Imports of goods and services (% of GDP) are very similar between Europe and Asia. Specifically, the mean for Asia is 39.79%, while for Europe it is 37.79%.
In conclusion, the large p-value suggests that there is no statistically significant difference in imports of goods and services between these two continents for the specified period.
What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)
# First, Group by `Country Name` and calculate the average population density for each year
avg_density_by_country_year <- data %>%
group_by(`Country Name`, Year) %>%
summarize(avg_density = mean(`Population density (people per sq. km of land area)`, na.rm = TRUE))
# Calculate the overall average population density for each country across all years
avg_density_by_country <- avg_density_by_country_year %>%
group_by(`Country Name`) %>%
summarize(avg_density = mean(avg_density))
# Identify the country (or countries) with the highest average population density
avg_density_by_country$`Country Name`[which.max(avg_density_by_country$avg_density)]
## [1] "Macao SAR, China"
# Plot a bar chart on the population density (people per sq. km of land area) across all years with the respective country name, and save it as a ggplot object
p3 <- ggplot(avg_density_by_country) +
aes(x = `Country Name`, y = avg_density) +
geom_col(fill = "#112446") +
labs(
x = "Country Names",
y = "Overall average population density for each country across all years (people per sq. km of land area)"
) +
theme_minimal()
# Convert the ggplot object to an interactive Plotly plot
ggplotly(p3)
So with the printed answer and bar chart, it also shows that Macao SAR, China has the highest population density (people per sq. km of land area across all years.
What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ between 1962 and 2007?
# Filter data for years 1962 and 2007
data_Q4 <- data %>%
filter(Year %in% c(1962:2007)) %>%
select(`Life expectancy at birth, total (years)`, Year, `Country Name`) %>%
drop_na(`Life expectancy at birth, total (years)`, Year)
# Calculate the change in life expectancy from 1962 to 2007 for each country
change_in_life_expectancy <- data_Q4 %>%
group_by(`Country Name`) %>%
summarize(Changes = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962], ) %>%
arrange(desc(Changes)) # Arrange countries by highest increase
# Identify the country (or countries) with the highest average population density
change_in_life_expectancy$`Country Name`[which.max(change_in_life_expectancy$Changes)]
## [1] "Maldives"
# Plot a bar chart on the increase in 'Life expectancy at birth, total (years)' with the respective country name, and save it as a ggplot object
p4 <- ggplot(change_in_life_expectancy) +
aes(x = `Country Name`, y = Changes) +
geom_col(fill = "#112446") +
labs(
y = "Differences of life expectancy at birth from 1962 to 2007"
) +
theme_minimal()
# Convert the ggplot object to an interactive Plotly plot
ggplotly(p4)
So with the printed answer and bar chart, it also shows that Maldives has the greatest increase in ‘Life expectancy at birth, total (years)’ between 1962 and 2007.